Goto

Collaborating Authors

 prediction change


A Concept uniqueness and granularity

Neural Information Processing Systems

Here, we report statistics about the uniqueness of neuron concepts, as we increase the maximum formula length of our explanations. Figure S1: Number of repeated concepts across probed vision and NLI models, by maximum formula length. Table S1: For probed Image Classification and NLI models, average number of occurrences of each detected concept and percentage of detected concepts that are unique (i.e. A.1 Image Classification Figure S1 (left) plots the number of times each unique concept appears across the 512 units of ResNet-18 as the maximum formula length increases. Table S1 displays the mean number of occurrences per concept, and percentage of concepts occurring that are unique (i.e.



Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring

Kim, Changhun, Mun, Yechan, Jang, Hyeongwon, Lee, Eunseo, Hahn, Sangchul, Yang, Eunho

arXiv.org Artificial Intelligence

Explaining online time series monitoring models is crucial across sensitive domains such as healthcare and finance, where temporal and contextual prediction dynamics underpin critical decisions. While recent XAI methods have improved the explainability of time series models, they mostly analyze each time step independently, overlooking temporal dependencies. This results in further challenges: explaining prediction changes is non-trivial, methods fail to leverage online dynamics, and evaluation remains difficult. To address these challenges, we propose Delta-XAI, which adapts 14 existing XAI methods through a wrapper function and introduces a principled evaluation suite for the online setting, assessing diverse aspects, such as faithfulness, sufficiency, and coherence. Experiments reveal that classical gradient-based methods, such as Integrated Gradients (IG), can outperform recent approaches when adapted for temporal analysis. Building on this, we propose Shifted Window Integrated Gradients (SWING), which incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics. Our code is publicly available at https://anonymous.4open.science/r/Delta-XAI.


The Directed Prediction Change - Efficient and Trustworthy Fidelity Assessment for Local Feature Attribution Methods

Iselborn, Kevin, Dembinsky, David, Lucieri, Adriano, Dengel, Andreas

arXiv.org Artificial Intelligence

The utility of an explanation method critically depends on its fidelity to the underlying machine learning model. Especially in high-stakes medical settings, clinicians and regulators require explanations that faithfully reflect the model's decision process. Existing fidelity metrics such as Infidelity rely on Monte Carlo approximation, which demands numerous model evaluations and introduces uncertainty due to random sampling. This work proposes a novel metric for evaluating the fidelity of local feature attribution methods by modifying the existing Prediction Change (PC) metric within the Guided Perturbation Experiment. By incorporating the direction of both perturbation and attribution, the proposed Directed Prediction Change (DPC) metric achieves an almost tenfold speedup and eliminates randomness, resulting in a deterministic and trustworthy evaluation procedure that measures the same property as local Infidelity. DPC is evaluated on two datasets (skin lesion images and financial tabular data), two black-box models, seven explanation algorithms, and a wide range of hyperparameters. Across $4\,744$ distinct explanations, the results demonstrate that DPC, together with PC, enables a holistic and computationally efficient evaluation of both baseline-oriented and local feature attribution methods, while providing deterministic and reproducible outcomes.


Towards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition

Zhang, Lintong, Yin, Kang, Lee, Seong-Whan

arXiv.org Artificial Intelligence

Attribution-based explanation techniques capture key patterns to enhance visual interpretability; however, these patterns often lack the granularity needed for insight in fine-grained tasks, particularly in cases of model misclassifica-tion, where explanations may be insufficiently detailed. T o address this limitation, we propose a fine-grained counterfactual explanation framework that generates both object-level and part-level interpretability, addressing two fundamental questions: (1) which fine-grained features contribute to model misclassification, and (2) where dominant local features influence counterfactual adjustments. Our approach yields explainable counterfactuals in a non-generative manner by quantifying similarity and weighting component contributions within regions of interest between correctly classified and misclassified samples. Furthermore, we introduce a saliency partition module grounded in Shapley value contributions, isolating features with region-specific relevance. Extensive experiments demonstrate the superiority of our approach in capturing more granular, intuitively meaningful regions, surpassing fine-grained methods.


Controlled Model Debiasing through Minimal and Interpretable Updates

Di Gennaro, Federico, Laugel, Thibault, Grari, Vincent, Detyniecki, Marcin

arXiv.org Machine Learning

Traditional approaches to learning fair machine learning models often require rebuilding models from scratch, generally without accounting for potentially existing previous models. In a context where models need to be retrained frequently, this can lead to inconsistent model updates, as well as redundant and costly validation testing. To address this limitation, we introduce the notion of controlled model debiasing, a novel supervised learning task relying on two desiderata: that the differences between new fair model and the existing one should be (i) interpretable and (ii) minimal. After providing theoretical guarantees to this new problem, we introduce a novel algorithm for algorithmic fairness, COMMOD, that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce (i) minimal and (ii) interpretable changes between biased and debiased predictions--a property that, while highly desirable in high-stakes applications, is rarely prioritized as an explicit objective in fairness literature. Our approach combines a concept-based architecture and adversarial learning and we demonstrate through empirical results that it achieves comparable performance to state-of-the-art debiasing methods while performing minimal and interpretable prediction changes. 1 Introduction The increasing adoption of machine learning models in high-stakes domains--such as criminal justice (Klein-berg et al., 2016) and credit lending (Bruckner, 2018)--has raised significant concerns about the potential biases that these models may reproduce and amplify, particularly against historically marginalized groups. Recent public discourse, along with regulatory developments such as the European AI Act (2024/1689), has further underscored the need for adapting AI systems to ensure fairness and trustworthiness (Bringas Col-menarejo et al., 2022). Consequently, many of the machine learning models deployed by organizations are, or may soon be, subject to these emerging regulatory requirements. Yet, such organizations frequently invest significant resources (e.g. The field of algorithmic fairness has experienced rapid growth in recent years, with numerous bias mitigation strategies proposed (Romei & Ruggieri, 2014; Mehrabi et al., 2021). These approaches can be broadly categorized into three types: pre-processing (e.g.,(Belrose et al., 2024)), in-processing (e.g.,(Zhang et al., 2018)), and post-processing(e.g., (Kamiran et al., 2010)), based on the stage of the machine learning pipeline at which fairness is enforced. While the two former categories do not account at all for any pre-existing biased model being available for the task, post-processing approaches aim to impose fairness by directly modifying the predictions of a biased classifier.


Early Stopping Against Label Noise Without Validation Data

Yuan, Suqin, Feng, Lei, Liu, Tongliang

arXiv.org Artificial Intelligence

Concretely, sparing more data for validation from training data would limit the performance of the learned model, yet insufficient validation data could result in a sub-optimal selection of the desired model. In this paper, we propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model in the presence of label noise. It works by tracking the changes in the model's predictions on the training set during the training process, aiming to halt training before the model unduly fits mislabeled data. This method is empirically supported by our observation that minimum fluctuations in predictions typically occur at the training epoch before the model excessively fits mislabeled data. Through extensive experiments, we show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels. Deep Neural Networks (DNNs) are praised for their remarkable expressive power, which allows them to uncover intricate patterns in high-dimensional data (Montufar et al., 2014; LeCun et al., 2015) and can even fit data with random labels. However, this strength, often termed Memorization (Zhang et al., 2017), can be a double-edged sword, especially when encountering label noise. When label noise exists, the inherent capability of DNNs might cause the model to fit mislabeled examples from noisy datasets, which can deteriorate its generalization performance. Specifically, when DNNs are trained on noisy datasets containing both clean and mislabeled examples, it is often observed that the test error initially decreases and subsequently increases. To prevent DNNs from overconfidently learning from mislabeled examples, many existing methods for learning with noisy labels (Xia et al., 2019; Han et al., 2020; Song et al., 2022; Huang et al., 2023) explicitly or implicitly adopted the operation of halting training before the test error increases--a strategy termed "early stopping". Early stopping relies on model selection, aiming to choose a model that aligns most closely with the true concept from a range of candidate models obtained during the training process (Mohri et al., 2018; Bai et al., 2021). To this end, leveraging hold-out validation data to pinpoint an appropriate early stopping point for model selection becomes a prevalent approach (Xu & Goodacre, 2018) in deep learning. However, this approach heavily relies on additional validation data that is usually derived by splitting the training set, thereby resulting in degraded performance due to insufficient training data.


Reviews: On Robustness to Adversarial Examples and Polynomial Optimization

Neural Information Processing Systems

The paper studies the robustness of machine learning classifiers against adversarial examples. The authors use semi definite programming (SDP) in order to obtain provable polynomial-time adversarial attacks against hypothesis classes that are degree-1 and degree-2 polynomial threshold functions. Furthermore, the authors show that learning polynomial threshold functions of degree 2 or higher in a robust sense (against adversarial examples) is computationally a hard problem. The authors also experiment with 2-layer neural networks and with an SDP method they either find adversarial examples or certify that none exist within a certain budget delta available to the adversary. First of all, there are parts of the paper that are very well written and other parts that I assume were rushed for the NeurIPS submission deadline.


Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Le-Cong, Thanh, Nguyen, Dat, Le, Bach, Murray, Toby

arXiv.org Artificial Intelligence

In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.


Feature Attribution with Necessity and Sufficiency via Dual-stage Perturbation Test for Causal Explanation

Chen, Xuexin, Cai, Ruichu, Huang, Zhengting, Zhu, Yuxuan, Horwood, Julien, Hao, Zhifeng, Li, Zijian, Hernandez-Lobato, Jose Miguel

arXiv.org Artificial Intelligence

We investigate the problem of explainability in machine learning.To address this problem, Feature Attribution Methods (FAMs) measure the contribution of each feature through a perturbation test, where the difference in prediction is compared under different perturbations.However, such perturbation tests may not accurately distinguish the contributions of different features, when their change in prediction is the same after perturbation.In order to enhance the ability of FAMs to distinguish different features' contributions in this challenging setting, we propose to utilize the probability (PNS) that perturbing a feature is a necessary and sufficient cause for the prediction to change as a measure of feature importance.Our approach, Feature Attribution with Necessity and Sufficiency (FANS), computes the PNS via a perturbation test involving two stages (factual and interventional).In practice, to generate counterfactual samples, we use a resampling-based approach on the observed samples to approximate the required conditional distribution.Finally, we combine FANS and gradient-based optimization to extract the subset with the largest PNS.We demonstrate that FANS outperforms existing feature attribution methods on six benchmarks.